pacman::p_load(sf, spdep, tmap, tidyverse,funModeling,blorr,corrplot,ggpubr,GWmodel, skimr, caret, tidyr)Geographically Weighted Logistic Regression- Nigeria Functional and Non Functional Water Points
Setting the scene
To build an exploratory model to discover factor affecting water point status in Osun State, Nigeria.
Study are: Osun State, Nigeria
Data Sets:
Osun.rds, contains LGAs boundaries of Osun State. It is in sf polygon data frame, and
Osun_wp_sf.rds, contains water points within Osun State. It is in sf point data frame.
Model Variables
- Dependent Variables: Water point status(i.e. functional/non-functional)
Independent Variables:
distance_to_primary_road
distance_to_secondary_road
distance_to_tertiary_road
distance_to_city
distance_to_town
water_point_population
local_population_1Km
usage_capacity
is_urban
water_source_clean
Getting Started
Installing R Packages
Using the code chunk, following packages will be installed into R environment
Data Import
In this class exercise, two data sets will be used.They are:
Importing analytical data
First, we are going to import the analytical data into R environment.
Osun <- read_rds("rds/Osun.rds")
Osun_wp_sf <- read_rds("rds/Osun_wp_sf.rds")Osun_wp_sf %>%
freq(input="status")Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
Please report the issue at <https://github.com/pablo14/funModeling/issues>.

status frequency percentage cumulative_perc
1 TRUE 2642 55.5 55.5
2 FALSE 2118 44.5 100.0
From the above chart, it can interpreted that there are 2642 observation of “Functional water points” and 2118 observations of “Non-Functional Water points”.
Visualizing the water point data using tmap
tmap_mode("view")tmap mode set to interactive viewing
tm_shape(Osun)+
tmap_options(check.and.fix= TRUE)+
tm_polygons(alpha=0.4)+
tm_shape(Osun_wp_sf)+
tm_dots(col= "status",
alpha=0.6)+
tm_view(set.zoom.limits = c(9,12))Exploratory Data Analysis
Summary Statistics with Skimr
Osun_wp_sf%>%
skim()Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
| Name | Piped data |
| Number of rows | 4760 |
| Number of columns | 75 |
| _______________________ | |
| Column type frequency: | |
| character | 47 |
| logical | 5 |
| numeric | 23 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1.00 | 5 | 44 | 0 | 2 | 0 |
| report_date | 0 | 1.00 | 22 | 22 | 0 | 42 | 0 |
| status_id | 0 | 1.00 | 2 | 7 | 0 | 3 | 0 |
| water_source_clean | 0 | 1.00 | 8 | 22 | 0 | 3 | 0 |
| water_source_category | 0 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| water_tech_clean | 24 | 0.99 | 9 | 23 | 0 | 3 | 0 |
| water_tech_category | 24 | 0.99 | 9 | 15 | 0 | 2 | 0 |
| facility_type | 0 | 1.00 | 8 | 8 | 0 | 1 | 0 |
| clean_country_name | 0 | 1.00 | 7 | 7 | 0 | 1 | 0 |
| clean_adm1 | 0 | 1.00 | 3 | 5 | 0 | 5 | 0 |
| clean_adm2 | 0 | 1.00 | 3 | 14 | 0 | 35 | 0 |
| clean_adm3 | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| clean_adm4 | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| installer | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| management_clean | 1573 | 0.67 | 5 | 37 | 0 | 7 | 0 |
| status_clean | 0 | 1.00 | 9 | 32 | 0 | 7 | 0 |
| pay | 0 | 1.00 | 2 | 39 | 0 | 7 | 0 |
| fecal_coliform_presence | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| subjective_quality | 0 | 1.00 | 18 | 20 | 0 | 4 | 0 |
| activity_id | 4757 | 0.00 | 36 | 36 | 0 | 3 | 0 |
| scheme_id | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| wpdx_id | 0 | 1.00 | 12 | 12 | 0 | 4760 | 0 |
| notes | 0 | 1.00 | 2 | 96 | 0 | 3502 | 0 |
| orig_lnk | 4757 | 0.00 | 84 | 84 | 0 | 1 | 0 |
| photo_lnk | 41 | 0.99 | 84 | 84 | 0 | 4719 | 0 |
| country_id | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| data_lnk | 0 | 1.00 | 79 | 96 | 0 | 2 | 0 |
| water_point_history | 0 | 1.00 | 142 | 834 | 0 | 4750 | 0 |
| clean_country_id | 0 | 1.00 | 3 | 3 | 0 | 1 | 0 |
| country_name | 0 | 1.00 | 7 | 7 | 0 | 1 | 0 |
| water_source | 0 | 1.00 | 8 | 30 | 0 | 4 | 0 |
| water_tech | 0 | 1.00 | 5 | 37 | 0 | 20 | 0 |
| adm2 | 0 | 1.00 | 3 | 14 | 0 | 33 | 0 |
| adm3 | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| management | 1573 | 0.67 | 5 | 47 | 0 | 7 | 0 |
| adm1 | 0 | 1.00 | 4 | 5 | 0 | 4 | 0 |
| New Georeferenced Column | 0 | 1.00 | 16 | 35 | 0 | 4760 | 0 |
| lat_lon_deg | 0 | 1.00 | 13 | 32 | 0 | 4760 | 0 |
| public_data_source | 0 | 1.00 | 84 | 102 | 0 | 2 | 0 |
| converted | 0 | 1.00 | 53 | 53 | 0 | 1 | 0 |
| created_timestamp | 0 | 1.00 | 22 | 22 | 0 | 2 | 0 |
| updated_timestamp | 0 | 1.00 | 22 | 22 | 0 | 2 | 0 |
| Geometry | 0 | 1.00 | 33 | 37 | 0 | 4760 | 0 |
| ADM2_EN | 0 | 1.00 | 3 | 14 | 0 | 30 | 0 |
| ADM2_PCODE | 0 | 1.00 | 8 | 8 | 0 | 30 | 0 |
| ADM1_EN | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| ADM1_PCODE | 0 | 1.00 | 5 | 5 | 0 | 1 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| rehab_year | 4760 | 0 | NaN | : |
| rehabilitator | 4760 | 0 | NaN | : |
| is_urban | 0 | 1 | 0.39 | FAL: 2884, TRU: 1876 |
| latest_record | 0 | 1 | 1.00 | TRU: 4760 |
| status | 0 | 1 | 0.56 | TRU: 2642, FAL: 2118 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| row_id | 0 | 1.00 | 68550.48 | 10216.94 | 49601.00 | 66874.75 | 68244.50 | 69562.25 | 471319.00 | ▇▁▁▁▁ |
| lat_deg | 0 | 1.00 | 7.68 | 0.22 | 7.06 | 7.51 | 7.71 | 7.88 | 8.06 | ▁▂▇▇▇ |
| lon_deg | 0 | 1.00 | 4.54 | 0.21 | 4.08 | 4.36 | 4.56 | 4.71 | 5.06 | ▃▆▇▇▂ |
| install_year | 1144 | 0.76 | 2008.63 | 6.04 | 1917.00 | 2006.00 | 2010.00 | 2013.00 | 2015.00 | ▁▁▁▁▇ |
| fecal_coliform_value | 4760 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| distance_to_primary_road | 0 | 1.00 | 5021.53 | 5648.34 | 0.01 | 719.36 | 2972.78 | 7314.73 | 26909.86 | ▇▂▁▁▁ |
| distance_to_secondary_road | 0 | 1.00 | 3750.47 | 3938.63 | 0.15 | 460.90 | 2554.25 | 5791.94 | 19559.48 | ▇▃▁▁▁ |
| distance_to_tertiary_road | 0 | 1.00 | 1259.28 | 1680.04 | 0.02 | 121.25 | 521.77 | 1834.42 | 10966.27 | ▇▂▁▁▁ |
| distance_to_city | 0 | 1.00 | 16663.99 | 10960.82 | 53.05 | 7930.75 | 15030.41 | 24255.75 | 47934.34 | ▇▇▆▃▁ |
| distance_to_town | 0 | 1.00 | 16726.59 | 12452.65 | 30.00 | 6876.92 | 12204.53 | 27739.46 | 44020.64 | ▇▅▃▃▂ |
| rehab_priority | 2654 | 0.44 | 489.33 | 1658.81 | 0.00 | 7.00 | 91.50 | 376.25 | 29697.00 | ▇▁▁▁▁ |
| water_point_population | 4 | 1.00 | 513.58 | 1458.92 | 0.00 | 14.00 | 119.00 | 433.25 | 29697.00 | ▇▁▁▁▁ |
| local_population_1km | 4 | 1.00 | 2727.16 | 4189.46 | 0.00 | 176.00 | 1032.00 | 3717.00 | 36118.00 | ▇▁▁▁▁ |
| crucialness_score | 798 | 0.83 | 0.26 | 0.28 | 0.00 | 0.07 | 0.15 | 0.35 | 1.00 | ▇▃▁▁▁ |
| pressure_score | 798 | 0.83 | 1.46 | 4.16 | 0.00 | 0.12 | 0.41 | 1.24 | 93.69 | ▇▁▁▁▁ |
| usage_capacity | 0 | 1.00 | 560.74 | 338.46 | 300.00 | 300.00 | 300.00 | 1000.00 | 1000.00 | ▇▁▁▁▅ |
| days_since_report | 0 | 1.00 | 2692.69 | 41.92 | 1483.00 | 2688.00 | 2693.00 | 2700.00 | 4645.00 | ▁▇▁▁▁ |
| staleness_score | 0 | 1.00 | 42.80 | 0.58 | 23.13 | 42.70 | 42.79 | 42.86 | 62.66 | ▁▁▇▁▁ |
| location_id | 0 | 1.00 | 235865.49 | 6657.60 | 23741.00 | 230638.75 | 236199.50 | 240061.25 | 267454.00 | ▁▁▁▁▇ |
| cluster_size | 0 | 1.00 | 1.05 | 0.25 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | ▇▁▁▁▁ |
| lat_deg_original | 4760 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| lon_deg_original | 4760 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| count | 0 | 1.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▇▁▁ |
Osun_wp_sf_clean <- Osun_wp_sf%>%
filter_at(vars(status,
distance_to_primary_road,
distance_to_secondary_road,
distance_to_tertiary_road,
distance_to_city,
distance_to_town,
water_point_population,
local_population_1km,
usage_capacity,
is_urban,
water_source_clean),
all_vars(!is.na(.)))%>%
mutate(usage_capacity = as.factor(usage_capacity))After the above code chunk run, it can be observed 4 observations are deleted and now there are total of 4756 observations with 75 columns.
Learnings from the above code chunk are:
exclude missing values (filtering for
all_vars(!is.na(.))); andrecode usage_capacity as factor (it only has 3 classes) instead of numerical data type. This is because the calibration of logit function will be different.
Correlation Analysis
Using the code chunk below, selected row will be filtered from” Osun_wp_sf_clean” data set and geometry column is dropped.
Osun_wp <- Osun_wp_sf_clean%>%
select(c(7,35:39,42,43,46,47,57))%>%
st_set_geometry(NULL)Next, we plot the correlation matrix for all the numerical data fields.
cluster_vars.cor= cor(
Osun_wp[,2:7])
corrplot.mixed(cluster_vars.cor,
lower= "ellipse",
upper= "number",
tl.pos= "lt",
diag= "l",
tl.col= "black")
From the above result, it can observed there are none of the variables that are highly correlated, i.e. correlation greater than +/- 0.8. Therefore, we will consider all the variables for the further analysis.
Building a Logistic Regression Models
Using the code chunk below, regression model is built.
model <- glm(status ~ distance_to_primary_road+
distance_to_secondary_road+
distance_to_tertiary_road+
distance_to_city+
distance_to_town+
is_urban+
usage_capacity+
water_source_clean+
water_point_population+
local_population_1km,
data= Osun_wp_sf_clean,
family= binomial(link= "logit"))Instead of using typical R report, blr_regress() of blorr package is used.
blr_regress(model) Model Overview
------------------------------------------------------------------------
Data Set Resp Var Obs. Df. Model Df. Residual Convergence
------------------------------------------------------------------------
data status 4756 4755 4744 TRUE
------------------------------------------------------------------------
Response Summary
--------------------------------------------------------
Outcome Frequency Outcome Frequency
--------------------------------------------------------
0 2114 1 2642
--------------------------------------------------------
Maximum Likelihood Estimates
-----------------------------------------------------------------------------------------------
Parameter DF Estimate Std. Error z value Pr(>|z|)
-----------------------------------------------------------------------------------------------
(Intercept) 1 0.3887 0.1124 3.4588 5e-04
distance_to_primary_road 1 0.0000 0.0000 -0.7153 0.4744
distance_to_secondary_road 1 0.0000 0.0000 -0.5530 0.5802
distance_to_tertiary_road 1 1e-04 0.0000 4.6708 0.0000
distance_to_city 1 0.0000 0.0000 -4.7574 0.0000
distance_to_town 1 0.0000 0.0000 -4.9170 0.0000
is_urbanTRUE 1 -0.2971 0.0819 -3.6294 3e-04
usage_capacity1000 1 -0.6230 0.0697 -8.9366 0.0000
water_source_cleanProtected Shallow Well 1 0.5040 0.0857 5.8783 0.0000
water_source_cleanProtected Spring 1 1.2882 0.4388 2.9359 0.0033
water_point_population 1 -5e-04 0.0000 -11.3686 0.0000
local_population_1km 1 3e-04 0.0000 19.2953 0.0000
-----------------------------------------------------------------------------------------------
Association of Predicted Probabilities and Observed Responses
---------------------------------------------------------------
% Concordant 0.7347 Somers' D 0.4693
% Discordant 0.2653 Gamma 0.4693
% Tied 0.0000 Tau-a 0.2318
Pairs 5585188 c 0.7347
---------------------------------------------------------------
Observations
It can be observed first two observations are more than the significance level of 0.05. Therefore, these variables will be excluded for further analysis as they are not significant.
In estimate column, if estimate is positive then that independent variable has positive correlation with dependent variable and if estimate is negative then that independent variable has negative correlation with dependent variable.
In the code chuck below, blr_confusion_matrix() of blorr package is used to compute the confusion matrix of the estimated outcomes by using 0.5 as the cutoff value.
blr_confusion_matrix(model, cutoff= 0.5)Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
0 1301 738
1 813 1904
Accuracy : 0.6739
No Information Rate : 0.4445
Kappa : 0.3373
McNemars's Test P-Value : 0.0602
Sensitivity : 0.7207
Specificity : 0.6154
Pos Pred Value : 0.7008
Neg Pred Value : 0.6381
Prevalence : 0.5555
Detection Rate : 0.4003
Detection Prevalence : 0.5713
Balanced Accuracy : 0.6680
Precision : 0.7008
Recall : 0.7207
'Positive' Class : 1
The validity of a cutoff is measured using sensitivity, specificity and accuracy.
- Sensitivity: The % of correctly classified events out of all events= TP/(TP+FN)
- Specificity: The % of correctly classified non-events out of all events= TN/(TN+FP)
- Accuracy: The % of correctly classified observation over all observations= (TP+TN)/ (TP+FP+FN+TN)
Observations
From the output, we see that the model gives us an accuracy of 0.668, which is a good start as it is better than guessing (0.5).
The sensitivity and specificity are 0.7207 and 0.6154 respectively. This shows that the true positives (functional water points) are slightly higher than the true negative prediction rates (non-functional water points).
Building Fixed Bandwidth GWR Model
Converting sf data frame to sp data frame
Next, we need to convert the sf data frame into spatial point data frame for GWR model building. This is done using the code chunk below.
Osun_wp_sp <- Osun_wp_sf_clean%>%
select(c(status,
distance_to_primary_road,
distance_to_secondary_road,
distance_to_tertiary_road,
distance_to_city,
distance_to_town,
water_point_population,
local_population_1km,
is_urban,
usage_capacity,
water_source_clean))%>%
as_Spatial()
Osun_wp_spclass : SpatialPointsDataFrame
features : 4756
extent : 182502.4, 290751, 340054.1, 450905.3 (xmin, xmax, ymin, ymax)
crs : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs
variables : 11
names : status, distance_to_primary_road, distance_to_secondary_road, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, is_urban, usage_capacity, water_source_clean
min values : 0, 0.014461356813335, 0.152195902540837, 0.017815121653488, 53.0461399623541, 30.0019777713073, 0, 0, 0, 1000, Borehole
max values : 1, 26909.8616132094, 19559.4793799085, 10966.2705628969, 47934.343603562, 44020.6393368124, 29697, 36118, 1, 300, Protected Spring
Note: We used cleaned version of data set for consistency in the geometrics with our model building (4 water points with missing values excluded).
Computing Fixed Bandwidth
bw.fixed <- bw.ggwr(status ~
distance_to_primary_road+
distance_to_secondary_road+
distance_to_tertiary_road+
distance_to_city+
distance_to_town+
water_point_population+
local_population_1km+
is_urban+
usage_capacity+
water_source_clean,
data= Osun_wp_sp,
family= "binomial",
approach= "AIC",
kernel= "gaussian",
adaptive= FALSE,
longlat= FALSE)bw.fixedgwlr.fixed <- ggwr.basic(status ~
distance_to_primary_road+
distance_to_secondary_road+
distance_to_tertiary_road+
distance_to_city+
distance_to_town+
water_point_population+
local_population_1km+
is_urban+
usage_capacity+
water_source_clean,
data= Osun_wp_sp,
bw= 2597.255,
family= "binomial",
kernel= "gaussian",
adaptive= FALSE,
longlat= FALSE)Warning in proj4string(data): CRS object has comment, which is lost in output; in tests, see
https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
Warning in proj4string(regression.points): CRS object has comment, which is lost in output; in tests, see
https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
Iteration Log-Likelihood
=========================
0 -1957
1 -1675
2 -1525
3 -1441
4 -1403
5 -1403
We look at the results below. Similar to when we build multiple linear regression model, the report has 2 sections - generalised regression (global model) results and geographically weighted (GW) regression results. Note that the global model does not have AICc result, so AIC should be used to compare the 2 models.
gwlr.fixed ***********************************************************************
* Package GWmodel *
***********************************************************************
Program starts at: 2022-12-18 00:53:19
Call:
ggwr.basic(formula = status ~ distance_to_primary_road + distance_to_secondary_road +
distance_to_tertiary_road + distance_to_city + distance_to_town +
water_point_population + local_population_1km + is_urban +
usage_capacity + water_source_clean, data = Osun_wp_sp, bw = 2597.255,
family = "binomial", kernel = "gaussian", adaptive = FALSE,
longlat = FALSE)
Dependent (y) variable: status
Independent variables: distance_to_primary_road distance_to_secondary_road distance_to_tertiary_road distance_to_city distance_to_town water_point_population local_population_1km is_urban usage_capacity water_source_clean
Number of data points: 4756
Used family: binomial
***********************************************************************
* Results of Generalized linear Regression *
***********************************************************************
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-124.555 -1.755 1.072 1.742 34.333
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Intercept 3.887e-01 1.124e-01 3.459 0.000543
distance_to_primary_road -4.642e-06 6.490e-06 -0.715 0.474422
distance_to_secondary_road -5.143e-06 9.299e-06 -0.553 0.580230
distance_to_tertiary_road 9.683e-05 2.073e-05 4.671 3.00e-06
distance_to_city -1.686e-05 3.544e-06 -4.757 1.96e-06
distance_to_town -1.480e-05 3.009e-06 -4.917 8.79e-07
water_point_population -5.097e-04 4.484e-05 -11.369 < 2e-16
local_population_1km 3.451e-04 1.788e-05 19.295 < 2e-16
is_urbanTRUE -2.971e-01 8.185e-02 -3.629 0.000284
usage_capacity1000 -6.230e-01 6.972e-02 -8.937 < 2e-16
water_source_cleanProtected Shallow Well 5.040e-01 8.574e-02 5.878 4.14e-09
water_source_cleanProtected Spring 1.288e+00 4.388e-01 2.936 0.003325
Intercept ***
distance_to_primary_road
distance_to_secondary_road
distance_to_tertiary_road ***
distance_to_city ***
distance_to_town ***
water_point_population ***
local_population_1km ***
is_urbanTRUE ***
usage_capacity1000 ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6534.5 on 4755 degrees of freedom
Residual deviance: 5688.0 on 4744 degrees of freedom
AIC: 5712
Number of Fisher Scoring iterations: 5
AICc: 5712.099
Pseudo R-square value: 0.1295351
***********************************************************************
* Results of Geographically Weighted Regression *
***********************************************************************
*********************Model calibration information*********************
Kernel function: gaussian
Fixed bandwidth: 2597.255
Regression points: the same locations as observations are used.
Distance metric: A distance matrix is specified for this model calibration.
************Summary of Generalized GWR coefficient estimates:**********
Min. 1st Qu. Median
Intercept -8.9630e+02 -4.9805e+00 1.7599e+00
distance_to_primary_road -1.9477e-02 -4.8092e-04 3.0174e-05
distance_to_secondary_road -1.5757e-02 -3.7583e-04 1.2438e-04
distance_to_tertiary_road -1.5673e-02 -4.2538e-04 7.6217e-05
distance_to_city -1.8447e-02 -5.6287e-04 -1.2745e-04
distance_to_town -2.2450e-02 -5.7335e-04 -1.5218e-04
water_point_population -5.2830e-02 -2.2810e-03 -9.8829e-04
local_population_1km -1.2757e-01 5.0016e-04 1.0647e-03
is_urbanTRUE -1.9866e+02 -4.3054e+00 -1.6908e+00
usage_capacity1000 -2.0846e+01 -9.7311e-01 -4.1596e-01
water_source_cleanProtected.Shallow.Well -2.0782e+01 -4.5536e-01 5.3278e-01
water_source_cleanProtected.Spring -5.2495e+02 -5.5983e+00 2.5500e+00
3rd Qu. Max.
Intercept 1.2829e+01 1075.4234
distance_to_primary_road 4.8497e-04 0.0143
distance_to_secondary_road 6.0665e-04 0.0259
distance_to_tertiary_road 6.7104e-04 0.0129
distance_to_city 2.3763e-04 0.0155
distance_to_town 1.9318e-04 0.0225
water_point_population 5.0564e-04 0.1313
local_population_1km 1.8177e-03 0.0392
is_urbanTRUE 1.2864e+00 746.9498
usage_capacity1000 3.0334e-01 5.9492
water_source_cleanProtected.Shallow.Well 1.7870e+00 67.5549
water_source_cleanProtected.Spring 6.7736e+00 331.1243
************************Diagnostic information*************************
Number of data points: 4756
GW Deviance: 2792.323
AIC : 4413.603
AICc : 4747.217
Pseudo R-square value: 0.5726785
***********************************************************************
Program stops at: 2022-12-18 00:54:08
Comparing the AIC values of the 2 models, we see that it is lower for the GW regression model at 4,413.603 then for the global regression model at 5,712.09
Model Assessment
Converting SDF into sf data.frame
To assess the performance of the gwLR, firstly, we will convert the SDF object in as data frame by using the code chunk below.
gwr.fixed <- as.data.frame(gwlr.fixed$SDF)Next, we will label that values greater or equal to 0.5 into 1, else 0. The result the logic comparison operation will be saved into a field called most.
gwr.fixed <- gwr.fixed %>%
mutate(most= ifelse(
gwr.fixed$yhat >= 0.5, T, F))Confusion Matrix
Next, we use confusionMatrix() of caret to display the confusion matrix of the GW model using fixed bandwidth method.
gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data=gwr.fixed$most, reference= gwr.fixed$y)
CMConfusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 1824 263
TRUE 290 2379
Accuracy : 0.8837
95% CI : (0.8743, 0.8927)
No Information Rate : 0.5555
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7642
Mcnemar's Test P-Value : 0.2689
Sensitivity : 0.8628
Specificity : 0.9005
Pos Pred Value : 0.8740
Neg Pred Value : 0.8913
Prevalence : 0.4445
Detection Rate : 0.3835
Detection Prevalence : 0.4388
Balanced Accuracy : 0.8816
'Positive' Class : FALSE
We see that the accuracy (0.8816 vs 0.66), sensitivity (0.986 vs 0.7207) and specificity (0.9005 vs 0.6154) values have all improved from the non-gwLR global model. By using the gwLR model, we can explain the functional and non-functional water points better now which allows better management of water points through localised strategies (e.g. look at the local neighbourhood regions within Osun state).
Visualizing gwLR
Before we visualise the results of the gwLR model, we clean up the data set for plotting by selecting the relevant data fields (mainly the status column which is the dependent or predicted variable) into a new sf data frame object wp_sf_select in the code chunk below.
wp_sf_select <- Osun_wp_sf_clean %>%
select(c(ADM2_EN, ADM2_PCODE,
ADM1_EN, ADM1_PCODE,
status))We then combine it with gwr.fixed which has the predicted values of the water point status, in the form of probabilities between 0 and 1.
gwr_sf.fixed <- cbind(wp_sf_select, gwr.fixed)The code chunk below is used to create an interactive point symbol map.
tmap_mode("view")tmap mode set to interactive viewing
actual <- tm_shape(Osun) +
tmap_options(check.and.fix = TRUE) +
tm_polygons(alpha = 0.4) +
tm_shape(gwr_sf.fixed) +
tm_dots(col = "status",
alpha = 0.6,
palette = "YlOrRd") +
tm_view(set.zoom.limits = c(9, 12))
prob_T <- tm_shape(Osun) +
tm_polygons(alpha = 0.4) +
tm_shape(gwr_sf.fixed) +
tm_dots(col = "yhat",
border.col = "gray60",
border.lwd = 1) +
tm_view(set.zoom.limits = c(9, 12))
tmap_arrange(actual, prob_T,
asp = 1, ncol = 2, sync = TRUE)We see that the predictions are largely aligned with the actual status of the water points, in line with the 88% accuracy rate.
Regression Model after excluding the variables that are not significant
Using the below code chunk, “distance_to_primary_road” and “distance_to_secondary_road” are excluded.
modelM <- glm(status ~
distance_to_tertiary_road+
distance_to_city+
distance_to_town+
is_urban+
usage_capacity+
water_source_clean+
water_point_population+
local_population_1km,
data= Osun_wp_sf_clean,
family= binomial(link= "logit"))
blr_regress(modelM) Model Overview
------------------------------------------------------------------------
Data Set Resp Var Obs. Df. Model Df. Residual Convergence
------------------------------------------------------------------------
data status 4756 4755 4746 TRUE
------------------------------------------------------------------------
Response Summary
--------------------------------------------------------
Outcome Frequency Outcome Frequency
--------------------------------------------------------
0 2114 1 2642
--------------------------------------------------------
Maximum Likelihood Estimates
-----------------------------------------------------------------------------------------------
Parameter DF Estimate Std. Error z value Pr(>|z|)
-----------------------------------------------------------------------------------------------
(Intercept) 1 0.3540 0.1055 3.3541 8e-04
distance_to_tertiary_road 1 1e-04 0.0000 4.9096 0.0000
distance_to_city 1 0.0000 0.0000 -5.2022 0.0000
distance_to_town 1 0.0000 0.0000 -5.4660 0.0000
is_urbanTRUE 1 -0.2667 0.0747 -3.5690 4e-04
usage_capacity1000 1 -0.6206 0.0697 -8.9081 0.0000
water_source_cleanProtected Shallow Well 1 0.4947 0.0850 5.8228 0.0000
water_source_cleanProtected Spring 1 1.2790 0.4384 2.9174 0.0035
water_point_population 1 -5e-04 0.0000 -11.3902 0.0000
local_population_1km 1 3e-04 0.0000 19.4069 0.0000
-----------------------------------------------------------------------------------------------
Association of Predicted Probabilities and Observed Responses
---------------------------------------------------------------
% Concordant 0.7349 Somers' D 0.4697
% Discordant 0.2651 Gamma 0.4697
% Tied 0.0000 Tau-a 0.2320
Pairs 5585188 c 0.7349
---------------------------------------------------------------
blr_confusion_matrix(modelM, cutoff= 0.5)Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
0 1300 743
1 814 1899
Accuracy : 0.6726
No Information Rate : 0.4445
Kappa : 0.3348
McNemars's Test P-Value : 0.0761
Sensitivity : 0.7188
Specificity : 0.6149
Pos Pred Value : 0.7000
Neg Pred Value : 0.6363
Prevalence : 0.5555
Detection Rate : 0.3993
Detection Prevalence : 0.5704
Balanced Accuracy : 0.6669
Precision : 0.7000
Recall : 0.7188
'Positive' Class : 1
It can be observed that there is not much change in the specificity, sensitivity and accuracy rate.
Determining the fixed bandwidth
bw.fixed_M <- bw.ggwr(status ~ distance_to_tertiary_road +
distance_to_city +
distance_to_town +
is_urban +
usage_capacity +
water_source_clean +
water_point_population +
local_population_1km,
data = Osun_wp_sp,
family = "binomial",
approach = "AIC",
kernel = "gaussian",
adaptive = FALSE, # for fixed bandwidth
longlat = FALSE) # input data have been converted to projected CRS(bw.fixed_M)Model Assessment
gwlr.fixedM <- ggwr.basic(status ~
distance_to_tertiary_road+
distance_to_city+
distance_to_town+
water_point_population+
local_population_1km+
is_urban+
usage_capacity+
water_source_clean,
data= Osun_wp_sp,
bw= 2597.255,
family= "binomial",
kernel= "gaussian",
adaptive= FALSE,
longlat= FALSE)Warning in proj4string(data): CRS object has comment, which is lost in output; in tests, see
https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
Warning in proj4string(regression.points): CRS object has comment, which is lost in output; in tests, see
https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
Iteration Log-Likelihood
=========================
0 -2034
1 -1772
2 -1635
3 -1561
4 -1530
5 -1530
gwlr.fixedM ***********************************************************************
* Package GWmodel *
***********************************************************************
Program starts at: 2022-12-18 00:54:10
Call:
ggwr.basic(formula = status ~ distance_to_tertiary_road + distance_to_city +
distance_to_town + water_point_population + local_population_1km +
is_urban + usage_capacity + water_source_clean, data = Osun_wp_sp,
bw = 2597.255, family = "binomial", kernel = "gaussian",
adaptive = FALSE, longlat = FALSE)
Dependent (y) variable: status
Independent variables: distance_to_tertiary_road distance_to_city distance_to_town water_point_population local_population_1km is_urban usage_capacity water_source_clean
Number of data points: 4756
Used family: binomial
***********************************************************************
* Results of Generalized linear Regression *
***********************************************************************
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-129.368 -1.750 1.074 1.742 34.126
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Intercept 3.540e-01 1.055e-01 3.354 0.000796
distance_to_tertiary_road 1.001e-04 2.040e-05 4.910 9.13e-07
distance_to_city -1.764e-05 3.391e-06 -5.202 1.97e-07
distance_to_town -1.544e-05 2.825e-06 -5.466 4.60e-08
water_point_population -5.098e-04 4.476e-05 -11.390 < 2e-16
local_population_1km 3.452e-04 1.779e-05 19.407 < 2e-16
is_urbanTRUE -2.667e-01 7.474e-02 -3.569 0.000358
usage_capacity1000 -6.206e-01 6.966e-02 -8.908 < 2e-16
water_source_cleanProtected Shallow Well 4.947e-01 8.496e-02 5.823 5.79e-09
water_source_cleanProtected Spring 1.279e+00 4.384e-01 2.917 0.003530
Intercept ***
distance_to_tertiary_road ***
distance_to_city ***
distance_to_town ***
water_point_population ***
local_population_1km ***
is_urbanTRUE ***
usage_capacity1000 ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6534.5 on 4755 degrees of freedom
Residual deviance: 5688.9 on 4746 degrees of freedom
AIC: 5708.9
Number of Fisher Scoring iterations: 5
AICc: 5708.923
Pseudo R-square value: 0.129406
***********************************************************************
* Results of Geographically Weighted Regression *
***********************************************************************
*********************Model calibration information*********************
Kernel function: gaussian
Fixed bandwidth: 2597.255
Regression points: the same locations as observations are used.
Distance metric: A distance matrix is specified for this model calibration.
************Summary of Generalized GWR coefficient estimates:**********
Min. 1st Qu. Median
Intercept -2.7771e+02 -3.9915e+00 2.9346e+00
distance_to_tertiary_road -2.0066e-02 -3.6016e-04 8.9385e-05
distance_to_city -3.0931e-02 -5.6273e-04 -1.0359e-04
distance_to_town -3.4702e-03 -4.3133e-04 -1.2398e-04
water_point_population -3.5450e-02 -2.0856e-03 -1.1271e-03
local_population_1km -5.8060e-02 4.0342e-04 1.0001e-03
is_urbanTRUE -3.0233e+02 -3.1725e+00 -1.4861e+00
usage_capacity1000 -4.5295e+01 -1.0249e+00 -3.8880e-01
water_source_cleanProtected.Shallow.Well -1.0470e+02 -4.2423e-01 5.9626e-01
water_source_cleanProtected.Spring -7.9160e+02 -5.4086e+00 2.5525e+00
3rd Qu. Max.
Intercept 1.0668e+01 1102.7459
distance_to_tertiary_road 5.3918e-04 0.0140
distance_to_city 1.2672e-04 0.0129
distance_to_town 2.2159e-04 0.0161
water_point_population 1.9400e-04 0.0569
local_population_1km 1.6838e-03 0.0293
is_urbanTRUE 8.9541e-01 739.6369
usage_capacity1000 3.5031e-01 5.9152
water_source_cleanProtected.Shallow.Well 1.8040e+00 52.4657
water_source_cleanProtected.Spring 6.5117e+00 152.2614
************************Diagnostic information*************************
Number of data points: 4756
GW Deviance: 3051.369
AIC : 4499.24
AICc : 4759.621
Pseudo R-square value: 0.5330355
***********************************************************************
Program stops at: 2022-12-18 00:54:46
We see that both gwLR models have lower AIC values than their global model counter parts.
Converting SDF into sf data frame
gwr.fixed_refined <- as.data.frame(gwlr.fixedM$SDF)gwr.fixed_refined <- gwr.fixed_refined %>%
mutate(most = ifelse(
gwr.fixed_refined$yhat >= 0.5, T, F))We similarly call the confusion matrix and statistics using confusionMatrix() of caret in the code chunk below.
gwr.fixed_refined$y <- as.factor(gwr.fixed_refined$y)
gwr.fixed_refined$most <- as.factor(gwr.fixed_refined$most)
CM_refined <- confusionMatrix(data = gwr.fixed_refined$most,
reference = gwr.fixed_refined$y,
positive = "TRUE")
CM_refinedConfusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 1792 302
TRUE 322 2340
Accuracy : 0.8688
95% CI : (0.8589, 0.8783)
No Information Rate : 0.5555
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7341
Mcnemar's Test P-Value : 0.4469
Sensitivity : 0.8857
Specificity : 0.8477
Pos Pred Value : 0.8790
Neg Pred Value : 0.8558
Prevalence : 0.5555
Detection Rate : 0.4920
Detection Prevalence : 0.5597
Balanced Accuracy : 0.8667
'Positive' Class : TRUE
We see that the accuracy 0.866, sensitivity 0.88 and specificity 0.84) values have all improved from the non-gwLR global model. By using the gwLR model, we can explain the non-functional water points better now which allows better management of water points through localised strategies (e.g. look at the local neighbourhood regions within Osun state).
The performance measures of the 4 logistic regression models are summarised in the table below.
| Performance Measure | Global regression with 10 variables | gwLR with 10 variables | Global regression with 8 variables | gwLR with 8 variables |
|---|---|---|---|---|
| Accuracy | 0.6739 | 0.8837 | 0.6726 | 0.8846 |
| Sensitivity | 0.7207 | 0.9005 | 0.7188 | 0.8986 |
| Specificity | 0.6154 | 0.8628 | 0.6149 | 0.8671 |
We see that the model accuracy and specificity improve very slightly by removing the non-statistically significant variables from the gwLR model, but the sensitivity drops slightly. Nevertheless, as we would be more interested in finding non-functional water points for maintenance etc., the gwLR model with 8 variables would be more useful with a higher specificity.
Visualizing using tmap
gwr_sf.fixed_refined <- cbind(wp_sf_select, gwr.fixed_refined)tmap_mode("view")tmap mode set to interactive viewing
actual <- tm_shape(Osun) +
tmap_options(check.and.fix = TRUE) +
tm_polygons(alpha = 0.4) +
tm_shape(Osun_wp_sf) +
tm_dots(col = "status",
alpha = 0.6,
palette = "YlOrRd") +
tm_view(set.zoom.limits = c(9, 12))
prob_T_refined <- tm_shape(Osun) +
tm_polygons(alpha = 0.4) +
tm_shape(gwr_sf.fixed_refined) +
tm_dots(col = "yhat",
border.col = "gray60",
border.lwd = 1) +
tm_view(set.zoom.limits = c(9, 12))
tmap_arrange(actual, prob_T_refined,
asp = 1, ncol = 2, sync = TRUE)We see that the predictions are largely aligned with the actual status of the water points, in line with the 88% accuracy rate.